Spin glass models for a network of real neurons 
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Ising models with pairwise interactions are the least structured, or maximum-entropy, probability 
distributions that exactly reproduce measured pairwise correlations between spins. Here we use this 
equivalence to construct Ising models that describe the correlated spiking activity of populations 
of 40 neurons in the salamander retina responding to natural movies. We show that pairwise 
interactions between neurons account for observed higher-order correlations, and that for groups 
of 10 or more neurons pairwise interactions can no longer be regarded as small perturbations in an 
independent system. We then construct network ensembles that generalize the network instances 
observed in the experiment, and study their thermodynamic behavior and coding capacity. Based 
on this construction, we can also create synthetic networks of 120 neurons, and find that with 
increasing size the networks operate closer to a critical point and start exhibiting collective behaviors 
reminiscent of spin glasses. We examine closely two such behaviors that could be relevant for neural 
code: tuning of the network to the critical point to maximize the ability to encode diverse stimuli, 
and using the metastable states of the Ising Hamiltonian as neural code words. 
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I. INTRODUCTION 



Physicists have long explored analogies between the 
statistical mechanics of Ising models and the dynamics 
of neural networks [TJ [H |3] • The goal of this effort has 
been not to simulate the details of particular networks, 
but to understand how interesting functions can emerge, 
collectively, from large populations of neurons. In the 
spirit of modern statistical mechanics, one hopes that 
these collective behaviors will have some degree of uni- 
versality, and hence that one can make progress without 
knowing all of the microscopic details of each system. 
A classic example of this work is the model of associa- 
tive or content-addressable memory due to Hopficld [1J, 
which is able to recover the correct memory from any of 
its subparts of sufficient size. Because the computational 
substrate of neural states in these models were binary 
'spins,' and the memories were realized as locally stable 
states of the network dynamics, methods of statistical 
physics could be brought to bear on theoretically chal- 
lenging issues such as the storage capacity of the network 
or its reliability in the presence of noise On the 
other hand, precisely because of these abstractions, it has 
not always been clear how to bring the predictions of the 
models into contact with experiment. 

Recently it has been suggested that the analogy be- 
tween Ising models and neural networks can be turned 
into a precise mapping, and connected to experimental 
data, using the maximum entropy framework [4 j. We 



imagine a neural system exposed to a stationary stim- 
ulus ensemble, in which simultaneous recordings of N 
neurons can be made. In small windows of time, a single 
neuron i either does (<7i = +1) or does not (o"i = — 1) 
generate an action potential or "spike" [5]; the state of 
the entire network in that time bin thus is described by a 
'binary word' or spin configuration {<7j}. As the system 
responds to its inputs, it visits each of these states with 
some probability P OX pt({o'i})- Even before we ask what 
the different states 'mean,' for example as code words 
in a representation of the sensory world, specifying this 
distribution requires us to determine the probability of 
each of 2 N possible states. In practice, once N increases 
beyond ~ 10, this becomes impossible. The idea of the 
maximum entropy construction is to measure low order 
moments of the distribution, such as the average proba- 
bility of spiking for each cell ((en)) and the correlations 
between pairs of cells (Cy = (<7i<7j) — (<7i)(c>j)), where 
the averages are taken over the course of the experiment, 
and then search for a probability distribution P({<7i}) 
that matches these experimental measurements but oth- 
erwise is as unstructured as possible. Minimizing struc- 
ture means maximizing entropy [6 , and given any set 
of moments or correlations that we want to match, the 
form of the maximum entropy distribution is easy to find 
analytically. 

For the particular case we are interested in, where the 
states of neurons are given by binary variables o\ and we 
match the mean spike probabilities and pairwise correla- 
tions, the maximum entropy distribution is 
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where the "magnetic fields" {hi} and the "exchange cou- 
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plings" {Jy} have to be set to reproduce the measured 
values of {(<Xi)} and {Cy}. This is exactly the Ising model 
with pairwise interactions among the spins. Thus, the 
maximum entropy construction derives the Ising model 
from real data, rather than postulating it as an approxi- 
mation to the underlying dynamics. The construction is 
not an analogy or metaphor, but an exact mapping — Eq 
should predict, given the measured correlations be- 
tween pairs of neurons, the probability of all states for the 
whole network of N neurons. Further, this is a minimal 
model, in that the real network can have more structure 
than predicted by the maximum entropy model, but not 
less. 

Conceptually, Jy describes the direct mutual interac- 
tion between neurons i and j that remains after the contri- 
butions from other interactions in the network through 
more circuitous paths have been disentangled from the 
corresponding correlation Cy , and hi represents the neu- 
ron's intrinsic bias towards firing or silence |U [7] . This 
construction thus takes us from the experimentally acces- 
sible correlation functions Cy and (ai) to the underlying 
Hamiltonian TL, which in turn determines the probabil- 
ity of every binary word {a{\ in the neural codebook. 
This path is the inverse of the usual problem in statis- 
tical physics, where we take the fields and couplings to 
be known (or chosen from a known distribution) and we 
must calculate the observable correlation functions. 

The surprising result of Ref [1] is that the pairwise 
Ising model provides a very accurate description of the 
combinatorial patterns of spiking and silence, {ct;}, in 
ganglion cells of the salamander retina as they respond 
to natural and artificial movies, and in cortical cell cul- 
tures. In other words, the frequency of appearance of all 
binary patterns across TV neurons can be explained by 
the interactions between pairs of neurons; consequently, 
Eq (JlJ only requires 0(N 2 ) parameters instead of the 
original 2 N to fully specify the distribution. After the 
initial success in the salamander retina, similarly encour- 
aging results were obtained in the primate retina, under 
very different stimulus conditions [9], in visual cor- 
tex [TUJ [TT] , and in networks grown in vitro [TJ] . Most 
of these detailed comparisons of theory and experiment 
were done for groups of N ~ 10 neurons, small enough 
that the full distribution P e xpt({c r i}) could be sampled 
experimentally and used to assess the quality of the pair- 
wise maximum entropy model of Eq (JT|) . 

We report here on our efforts to move towards larger 
networks of neurons and explore the kinds of collective ef- 
fects that might be expected in networks of a few hundred 
elements in size. Since making the first version of our 
results available [13], this work has stimulated research 
into the tractability and approximate methods for solv- 
ing the maximum entropy problem [T3J [131 HS1 El HH1 US] 
and generalizations of the pairwise model [2T)ll2Tll2"2"ll23| . 
There is also growing interest in the use of maximum en- 
tropy methods to describe a wider range of biological sys- 
tems, from protein structure to gene regulatory networks 
to ecosystems |H HS1 HI1 [551 [23] • Here we present 



a more detailed exposition, arguing that our findings, in 
addition to providing a good fit to the data, can help us 
understand how networks of neurons function. 

In detail, we extend the maximum entropy results to 
N = 40 neurons using Monte Carlo methods. To moti- 
vate the discussion, we start by illustrating how strong 
collective effects can emerge in a toy mean-field Ising 
model used in the context of an inverse problem. We then 
extract realistic Ising models from the data; by studying 
subnetworks drawn from the full 40-neuron network, we 
first present an argument for why pairwise models can be 
successful in accounting for neural data, at least for small 
enough N. We then argue that the observed subnetworks 
are typical of an ensemble, for which we provide an ex- 
plicit construction and out of which we can draw larger, 
synthetic networks. Remarkably, these larger networks 
seem to be poised very close to a critical point and ex- 
hibit a rich vocabulary of locally stable states, allowing 
us to hypothesize about possible combinatorial coding 
mechanisms and, as a result, predict two interesting col- 
lective behaviors of neural networks that should become 
visible in the next generation of experiments ESI EU 121] • 



II. DISTRIBUTIONS OF OUTPUT WORDS 

As a concrete example, we consider a population of 
retinal ganglion cells — the neurons which provide the 
brain with all of its input about the visual world — 
responding to naturalistic movies. In such experiments, 
it is conventional to ask for a model which can pre- 
dict the response of the neurons to arbitrary stimuli, 
P(Wi} I stimulus). In the natural setting, stimuli are 
drawn from a space of very high dimensionality, and so 
constructing this map from stimuli to responses is very 
challenging [331 1311 EH ES] . Alternatively, we can ask for 
a 'dictionary' that describes the stimuli consistent with 
particular patterns of activity, P(stimulus|{cri}) [5]. Here 
we take a very different approach, largely ignoring the vi- 
sual stimulus and trying to understand the distribution 
of responses, P({cti}). Before proceeding, we explain why 
this seemingly more limited problem is of interest. 

Even when we measure the correlation between two 
neurons, Cy, the usual approach is to dissect the corre- 
lation into contributions which are intrinsic to the net- 
work and those which can be ascribed to common, stim- 
ulus driven inputs. The idea of decomposing correlations 
dates back to a time when it was hoped that correla- 
tions among spikes could be used to map the synaptic 
connections between neurons [37]. In fact, in a highly in- 
terconnected system, the dominant source of correlations 
between two neurons — even if they are entirely intrinsic 
to the network — will always be through the multitude 
of indirect paths involving other neurons |38j . On the 
other hand, for neurons in the early stages of sensory 
processing, it is often suggested that the responses are 
"conditionally independent" given the stimulus, that is 
P({ fi}| stimulus) = Yli -Pi ((Ti| stimulus) EH]. In the par- 
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ticular case considered here, we know that, for popula- 
tions of N — 10 neurons, this model already fails dra- 
matically [1]. 

The question of whether correlations are driven by the 
stimulus or are intrinsic to the network is not a question 
that the brain can answer. We, as external observers, can 
repeat the stimulus exactly, and search for correlations 
conditional on the stimulus, but this is not accessible to 
the organism. The brain has access only to the output 
of the retina, patterns of activity which are drawn from 
the distribution P({<7i}). If the responses {c;} are code- 
words for the visual stimulus, then the entropy of this 
distribution sets the capacity of the code to carry infor- 
mation. Word by word, — logP({t7i}) determines how 
surprised the brain should be by each particular pattern 
of response, including the possibility that the response 
was corrupted in transmission and thus should be cor- 
rected or ignored [40] . In a very real sense, what the 
brain 'sees' are sequences of states drawn from the dis- 
tribution P({<7i}). In the same spirit that many groups 
have studied the statistical structures of natural scenes 
[U EH E2 IH Hi H], we would like to understand 
the statistical structure of the codewords that represent 
these scenes. We shall see that this structure has features 
which are suggestive of new ideas about the nature of the 
retinal code. 



III. A SIMPLE EXAMPLE 
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Since all neurons and pairs are equivalent , the maximum 
entropy model consistent with pairwise correlations has 
the simpler form, 



1 



Z(h,J) 



exp 



N 



J 



N 



(4) 



which is just the mean field ferromagnet (assuming that 
J is positive, and that the temperature ksT, which here 
does not have a direct physical interpretation, has been 
set to 1). The solution of the mean field model is well 
known |46j : here we recall a few details which will be 
especially important in the context of neurons. 

As usual, everything we need to know is encoded in 
the partition function, which we evaluate by introducing 
an auxiliary field (f>: 
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We would like to illustrate these ideas with a simple 
example. Imagine that we record from N neurons, and 
we find that all of them have the same mean rate of spik- 
ing, f. Further, if we look at any pair of neurons, the 
probability of both spiking in the same small window of 
time is a bit larger than expected if they were indepen- 
dent. To be precise, we choose time windows of duration 
At, so the probability of a spike is q = f At and the 
probability of coincident spikes from two particular cells 
is p c = q 2 (l + e). We want to describe this network 
as above, with Ising variables o\ = +1 for spiking and 
<j\ = — 1 for silence. Then we have 



= -1 + 2g, 
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where the effective free energy is 
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If N is large and we hold NJ finite, then the integral in 
Eq ^ is dominated by the mean field or classical value 
for 0, defined by the minimum of F^, 



(f, c = NJtanh{h + 4> c ). 
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Evaluating Z to first order in the fluctuations around <j) c , 
we have 
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where • • • refers to higher order terms in 1/N, again assuming that NJ stays finite as N becomes large. 
To fix the values of h and J, we need to solve for the expectation values (cr;) and (<Ti<7j): 

1 dlnZ(h,J) 1 NJtjl - 1 2 ) 
= (^i) = — 1 + — ^7 — 77T7: 
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(ii) 
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where t = tanh(/i+0 c ). Rearranging and retaining terms 
to order 1/iV or larger, we have 
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We recall that we usually require NJ to scale as N J ~ 
0(1) at large N, because we want to have a well de- 
fined thermodynamic limit, in which energy and entropy 
are extensive quantities. In particular, with NJ = Jq, 
we can write the effective free energy in Eq (|7| as 
Fiq{(j>\ h, J) — F((f>; h, J ), where F has no explicit N de- 
pendence. Another key (and familiar) point is that, with 
the usual scaling of NJ constant as N becomes large, 
the correlations behave as e oc 1/N [Eq ( 13 1], and in the 
N — > oo limit, they must decay to zero. 

To connect the Ising models to the data, we face a new 
problem: we do not have direct access to the parameters 
J and h but rather have to infer them from measured 
q and e. As we examine recordings of increasingly large 
subsets of neurons taken from a dense patch of the retina, 
we could find that instead of decaying to 0, the average 
pairwise correlation e stays constant. This would mean 
that NJ cannot be constant as we consider larger groups 
of neurons, and that hence increasingly large subsets of 
neurons do not comprise an ensemble with a conventional 
thermodynamic limit. 

It is easy to find an example of such a behavior and 
gain intuition if we limit ourselves to a (somewhat un- 
realistic) case when h = 0. Then, Eq ^ for the mean 
field value of cj> c has either 1 or 3 solutions, depending 
on the value of J - the solution cf> c — corresponds to 
the paramagnetic phase, and the non-zero solutions cor- 
respond to spontaneous magnetization. The transition 
from 1 to 3 solutions is the phase transition in the Ising 
ferromagnet, and it occurs at the critical value Jq = 1 in 
the thermodynamic limit. 

Suppose that we are in the paramagnetic phase, with 
the solution 4> c = 0, t = and J < 1. Equation (10) 
then tells us that (rri) = 0, or q = 0.5. If the average 
correlation is found experimentally to be e, then we can 
compute how Jo 



NJ must scale from Eq ( 13 1: 
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As we assumed, for every finite N, Jq will stay below 
the critical value of 1 and thus choosing the solution with 
t = is self-consistent. We also note, however, that 
as N increases, the system approaches the critical value 
Jq = 1, regardless of the value of e, as long as that does 
not decrease faster than 1/N. To detect such an onset of 
criticality, one could perturb the coupling constans h and 
J by introducing a fictitious temperature T, perturbing 



it around the value 1, and observing the emergence of 
the peak in heat capacity C(T) = TdS/dT at T = 1. 

In the real data the mean firing rates and the corre- 
lations are not homogenous across the population, and 
the ((Tj) = assumption clearly is not valid, because 
the neurons spike infrequently, making q <C 1, so that 
our toy model cannot be taken too literarily. Nonethe- 
less, this exercise teaches us that one must be careful in 
transferring our intuition about thermodynamic limits to 
the case of maximum entropy models for real networks. 
It is not just that real networks have finite N; indeed, 
many networks probably have N sufficiently large that 
some sort of large N approximation is valid. The prob- 
lem is rather that, since the correlations among pairs of 
neurons are experimentally measurable, we are not free 
to let these vary as we imagine recording from more and 
more neurons. More precisely, having a non-zero correla- 
tion between all pairs in a large, homogeneous network is 
inconsistent with the large N limit except at the critical 
point. This prepares us for the possibility that, even in a 
more realistic, inhomogcncous system, the common ob- 
servation of nontrivial correlations among most pairs of 
neurons points toward an interesting regime of operation. 



IV. LEARNING MAXIMUM ENTROPY 
MODELS FROM REAL DATA 

We recall that maximum entropy models are the least 
structured, or most random, models consistent with 
known expectation values [51 [7] . The Ising model of Eq 
([lj thus is the minimal model forced upon us by mea- 
surements of mean spike probabilities and pairwise cor- 
relations. In maximum entropy models that match only 
the measured mean spike probability but not pairwise 
correlations (the so-called independent models), the cou- 
plings J;j are zero, and the probability distribution fac- 
tors into independent contributions from each neuron. 
As has been shown elsewhere, and as we also reiterate 
later in this paper, independent models completely fail 
to account for the data, and the pairwise extension is 
therefore the next minimal step that we are required to 
take. In addition, the pairwise model also is a minimal 
model in which we can expect to observe any interesting 
collective behaviors. 

To be concrete, we consider the salamander retina re- 
sponding to naturalistic movie clips, as in the experi- 
ments of Refs [H H7] . The visual stimulus consists of a 
26.2 s movie that was projected onto the retina 145 times 
in succession, for a total of roughly one hour of stable 
recordings. Using bins of At = 20 ms along the time 
axis yields 1310 samples per movie repeat, for a total of 
189950 binary word samples, where in each time bin each 
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system, and we consider a class of models 
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FIG. 1: Expectation values and interactions. At left, the 
expectation values computed from the experimental data; at 
right, the corresponding terms in the effective Hamiltonian of 
the maximum entropy model. Neurons are numbered in order 
of decreasing mean spike rate, (a) The pairwise connected 
correlations, Cij = (<Xi<7j) — (°"i)( (J j}; note that we see both 
positive and negative correlations, and that these correlations 
are small, (b) The interactions Jij in the Ising model; note 
that interactions spread more uniformly through the network 
than the correlations, (c) The mean "magnetization" of the 
individual cells, (d) The "magnetic fields" that express the 
intrinsic tendency of the neurons toward spiking or silence, 
(e) The histogram of correlations, (f) The distribution of 
interactions inferred for sub-networks of different size. 



neuron can either fire or be silent. The effective number 
of independent samples is smaller because of correlations 
across time; using bootstrap error analysis we estimate 
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7-10 [25]. Under these conditions, pairs of cells 
j 200 /im of each other have correlations drawn 
from a homogeneous distribution; the correlations decline 
at larger distance 48J. This correlated patch contains 
N ~ 200 neurons [3D], of which we record from N — 40 
[I]. The values of {<Ji) and Cy for this population of cells 
are shown in Fig[T]i and[l]:, respectively. 

The central problem is to find the magnetic fields and 
exchange interactions that reproduce the observed pair- 
wise correlations, ft is convenient to think of this prob- 
lem more generally: We have a set of operators M ({ci}) 
(= {<7i,<7i<7j} for the pairwise model) on the state of the 
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our problem is to find the couplings g (= {hi, Jy} for 
the pairwise model) that generate the correct expectation 
values, which is equivalent to solving the equations 
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Up to N ~ 20 cells we can solve exactly, but this ap- 
proach does not scale to N — 40 and beyond because 
the partition sum Z contains a number of terms that 
is exponential in N. For larger systems, this "inverse 
Ising problem" or Boltzmann machine learning, as it is 
known in computer science [49j , is a hard computational 
problem that has to be solved by approximate schemes 

P3HE1EI]. 

Given a set of coupling constants g, we can estimate 
the expectation values (0^) s by Monte Carlo simulation. 
Increasing the coupling g^ will increase the expectation 
value (Ofj), so a plausible algorithm for learning g is to 
increase each g^ in proportion to the deviation of (O^) 
(as estimated by Monte Carlo) from its target value (as 
estimated from experiment). This is not a true gradi- 
ent ascent, since changing g^ has an impact on operators 
(Ov^p), but such an iteration scheme does have the cor- 
rect fixed points; heuristic improvements include a slow- 
ing of the learning rate with time and the addition of 
some 'inertia', so that we update g^ according to 



Ag^t+1) = - V (t) (6 M ) g(t) - <0„) 



)ex P t +aAg^(t), 
(18) 

where rj(t) is the time-dependent learning rate and a 
measures the strength of the inertial term. 

We used different variations on this basic strategy for 
networks of different size. As noted above, for < 20 we 
solve exactly using a custom implementation of algorithm 
described in Ref [50]; this involves fully evaluating the 
partition sum. For A^ = 40, we first obtained a good ini- 
tial approximation for g by contrastive divergence Monte 
Carlo 1511 . Then we followed up until convergence by 



using Eq (18) directly and decreased the learning rate 
7](t) as Oiljt) or slower according to a custom sched- 
ule, and kept the inertia a at zero. Later in the paper 
we consider synthetic networks with A^ = 120, and for 
these we use a = 0.9. Learning the couplings g was 
slow for N — 120 synthetic networks, but eventually con- 
verged; even in this case Cy converged to within 10% for 
the largest quartilc of elements by absolute value, and 
within 15% for the largest half, without obvious system- 
atic biases. In addition to these algorithmic issues, the 
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Hamiltonian was rewritten such that Jy was constrain- 
ing (cr ; - ((Ti) eX pt)(crj - (o-j)cxpt), and we found that this 
removed biases in the reconstructed covariances. 

Figure [2] shows the success of the learning algorithm by 
comparing the measured pairwise correlations to those 
computed from the inferred Ising model for 40 neurons. 
The reconstructions of the means, (cr;), are not shown, 
but are accurate to better than 1%. 

The pairwise maximum entropy model has many fewer 
parameters (820) than there are possible states of the 
network (~ 10 12 ); more importantly, the effective num- 
ber of independent sample in our data set is ~ 7 x 10 4 , 
and this also is much larger than the number of parame- 
ters. Further, the parameters of the model are functions 
of the mean firing rates and pairwise correlations. Thus, 
to the extent that we can measure these quantities re- 
liably, there is no issue of having a model which is too 
complex to be supported by the data. Nonetheless, one 
would like an independent test to show that we are not 
'over-fitting' to a limited data set. To do this, we divide 
the available data in half to create separate training and 
testing data. We then use the model learned from the 
training data to compute the average log-likelihood of 
the data in each half, and measure the difference 
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where k indexes one random split of the data into train- 
ing and test halves, and gk is the set of parameters that 
we learn from the training data on that split. It should be 
noted that in the computation of A, the partition func- 
tion Z(gfe) cancels, so really we are computing the differ- 
ence in the mean "energy" between the training and test 
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FIG. 2: a) Precision of the Ising model for N = 40 neurons 
learned via Eq \18\ : measured covariances are binned on the 
x-axis and plotted against the corresponding reconstructed 
covariances on the y-axis; vertical error bars are the standard 
deviation within the bin, horizontal error bars are bootstrap 
errors on covariance estimates, b) Zoom-in for small (7y . The 
scale bar represents the standard deviation of Cy in shuffled 
data, where all observed correlations arise from insufficient 
sampling. 



data. With twenty random choices for the training/test 
split, we find that A = 0.0078 ± 0.0086. Thus A has 
enough variance to consistently include zero with high 
probability, implying that over-fitting is not a problem. 



V. DOES THE MODEL 'WORK' ? 

For N — 10 or even TV = 20, we can compare the 
predictions of the maximum entropy model with the 
observed frequencies of the network states {ci}. For 
N = 40, this is no longer possible, since the number of 
states is much larger than the number of samples avail- 
able in the experiment. What we can do is to compute, 
in the maximum entropy model, various statistics be- 
yond the pairwise correlations, and check these quanti- 
ties against estimates from the experimental data. In Fig 
[3] we show two examples of this approach. 

First we calculate the probability that K of the N 
neurons spike in the same time bin. The data show that 
P{K) is an approximately exponential distribution, enor- 
mously far from the roughly binomial distribution ex- 
pected if the neurons were spiking independently. The 
exponential structure of P{K) is well reproduced by the 
predictions from the maximum entropy model; the max- 
imum entropy model has a tail which is slightly too 
heavy. In more detail, the Ising model underestimates 
the probability of the no-spike pattern by a few percent, 
P cxpt (K = 0) = 0.550 vs. Pi slng (K = 0) = 0.502. The 
same deviation is already observed at N — 20, where 
P cxpt (0) = 0.621 vs. Pi si „ g (0) = 0.599, and since at this 
network size we reconstruct the Ising model using an ex- 
act algorithm, the deviations at N = 40 are not due 
to limitations of our Monte Carlo algorithm. Because 
the no-spike pattern is sampled well in our dataset, this 
deviation is significant and indicates that higher order 
interactions are starting to have an effect, albeit a small 
one. 

As a second test, we consider the correlations among 
triplets of neurons. We look for the correlated deviations 
of triplets of neurons from their mean firing rates and 
define the connected third-order correlation coefficients, 
(CTi<7jf7k)( c ) = ((cri-((Ti))(<Tj-(crj))(cr k -(f7 k ))). Of course 
the independent model predicts that these will be zero, 
whereas about a third of the (40)(39)(38)/(3!) - 10 4 
distinct triplets of neurons have correlations which are 
significantly different from zero. The maximum entropy 
model certainly captures the trend of these correlations, 
although it does tend to overestimate the large ones 
slightly, by ~ 7% on average. It is worth recalling that 
these ~ 10 4 triplet correlations are being predicted from 
a model that is determined only by the 820 measured 
means and pairwise correlations, with no additional fit- 
ting allowed. In this context, 7% errors are small, and 
are detectable only because we have tens of thousands of 
samples. 

The maximum entropy formalism is a series of ever bet- 
ter approximations to the true distribution P OX pt({0i}), 



7 



-data 
-Ising 

-independent 




0.02 0.04 0.06 



FIG. 3: a) Probability of observing K simultaneous spikes 
(blue - data, red - Ising model), compared to the failure of 
the independent model, where = (black). Dashed lines 
show error estimates, b) Measured vs predicted connected 
three-point correlations for 40 neurons (red) and for an exact 
Ising model for a 20 neuron subset (black). Horizontal error 
bars are bootstrap estimates from data, vertical error bars are 
computed across 20 Monte Carlo runs. 



in which the approximations match the data to increasing 
correlation orders. There is no a priori reason why pair- 
wise order alone should suffice, but our results show that 
even at TV = 40 the Ising model performs exceedingly 
well, closing most of the gap between the independent 
model and the real data. The correct view of this result 
is not that pairwise model is exactly correct, since clearly 
it isn't, but rather that this model is capturing a large 
fraction of the correlation structure in the data, even the 
higher order structure. 

Given that the pairwise model does work very well, 
there are at least three questions we could ask. One is 
to address the small departures from the model, either 
by adding explicit higher order interactions, or by trying 
to infer the existence of 'hidden' elements that generate 
effective multi-neuron interactions even if the dominant 
elementary interactions are among pairs. This is a ques- 
tion that we leave for future work. The next question is 
to try and understand why the model works as well as 
it does. Finally, given that the model works, we can ask 
what it teaches us about the network as a whole, and 
about the neural code in the retina. We take up these 
two questions in the next sections. 



VI. EXPLORING SUBNETWORKS 

In the current experimental setup, simultaneous 
recordings can only be made on a subset of neurons from 
a patch of the retina with highly correlated firing activity. 
It is thus relevant to ask about the effects of unobserved 
or 'hidden' nodes on the validity of the pairwise model. 
To explore this issue with the data we do have, we can 
'hide' some of the cells in the TV = 40 network to which 
we have access, and ask how this changes the quality of 
the maximum entropy description. Intuitively, it is sur- 
prising that pairwise models work well both on TV = 40 



neurons and networks consisting of subsets of these neu- 
rons, as in the original Ref [4]. We would expect that not 
observing a x will induce a triplet interaction J Q . ( g 7 among 
neurons {er Q , erg, 07} for any triplet in which there were 
pairwise couplings between a x and all triplet members. 
Additionally, by comparing the parameters in the full 
network, g^ 40 \ with their corresponding averages from 
different subnets of size TV = 20, g*- 20 - 1 , we find that the 
couplings J;j are left almost unchanged, while magnetic 
fields hi change substantially. 

To explain both phenomena, we examine the flow of 
the couplings under decimation. Let us start by includ- 
ing a three-body interaction term JjikCiOjCTk m to the 
TV-neuron Hamiltonian #) of Eq Q for pW({<7i}). 
Then, we isolate terms related to spin ctn, and marginal- 
ize over on to obtain p( N ~ ^({eri}). This summation 
results in 

P^-^W) cx e-"^ 1 ' x (20) 

^Iog(2 cosh [^N+^. JiNCTi + | JijNCiCTj] ) 

Because of the last term, the resulting expression for 
p(N-i) c i ear iy does not have the same form as that for 
p( N ) . We can, however, expand around on = — 1 up to 
terms of order 0((1 + a) 4 ) as long as J^i JiN^i + 1), an d 
the similar term for triplets, are small; we then look for a 
decimated Hamiltonian, TL^^ 1 ^, with renormalized cou- 
plings and a matching power series. This results in the 
following flow of coupling when a single neuron <tn has 
been marginalized over: 



h 



+ Y^j As 

J ijk + 0( 7 , 6) 



(21) 
(22) 
(23) 



where J iN = JiN - 53j -AiNi Aj = ^in^n(1 - + ^ Jjn 
and <jj — tanh(/iN — J2i ^iN + \ Xaj ■Ajn)- The terms 
7, S cx (1 — to 2 ) originate from terms with 3 and 4 factors 
of <7, respectively. The key point is that neurons spike 
very infrequently (on average in ~ 2.4% of the bins) and 
so (<7j) « —1, in which case ui is approximately the hy- 
perbolic tangent of the mean field at site N and is close 
to —1. If pairwise Ising is a good model at size TV, and 
couplings are small enough to permit expansion, then the 
corrections to pairwise terms after decimation, as well as 
Jijk, are multiplied by 1 — oj 2 <C 1. Sparsity of spikes 
therefore keeps the complexity in check: when a neu- 
ron cannot be observed, the correction to the pairwise 
couplings among the 'visible' neurons, as well as the ap- 
pearance of higher order interactions between them, are 
suppressed. 

We test these ideas by selecting 100 random subgroups 
of 10 neurons out of 20; for each, we compute the exact 
pairwise Ising model from the data, as well as applying 
Eqs ( 21 - 23 ) , with full expressions for 7 and 6, 10 times in 



succession to decimate the network from 20 cells down to 
the chosen 10. The resulting three-body interactions Jijk 
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have a mean and standard deviation ten times smaller 
than the pairwise Jy. If we ignore these terms, the aver- 
age Jensen-Shannon divergence [521 between this prob- 
ability distribution and the best pairwise model for the 
N = 10 subgroups is D JS = 9.3±5.4 x lCT 4 bits, smaller 
than the average divergence between either model and 
the experimental data, and means that S> 10 3 samples 
would be required to distinguish reliably between the two 
models. At least for the range of N examined here, the 
decimation calculation provides a useful approximation. 

The sparsity of spikes can explain the dominance of 
pairwise interactions: the higher order terms are not in- 
trinsically small, but the fact that spiking is rare means 
that they do not have much chance to contribute. Thus, 
the nature of the pairwise model is more like a Mayer 
cluster or virial expansion than like simple perturbation 
theory. Of course, with finite N, all quantities must 
be analytic functions of the coupling constants, and so 
we expect that, if carried to sufficiently high order, any 
perturbative scheme will converge — although this con- 
vergence may become much slower at larger N, signaling 
genuinely collective behavior in large networks. There are 
a number of reasons to think that (contrary to the sug- 
gestion in Ref [18] . but consistent with the conclusions 
of Ref [IS]) the real system we are studying is already 
outside the regime in which low order perturbative ap- 
proaches are sufficient. First, the simple relation between 
Jjj and Cjj predicted by the lowest order of perturbation 
theory [TS] [53] is violated, as shown in Fig [ip, and the 
resulting errors in the predicted P({ai}) are significant, 
as shown in Fig [IJj. Second, if we make a perturbative 
expansion of the entropy itself in powers of the measured 
correlations, even carrying this out to fourth order fails 
to work for N > 10 neurons [S3]. Finally, the standard 
deviation of 'effective field' <j>i — ^ Jij^j that represents 
the influence on neuron i by its neighbors is comparable 
to the intrinsic bias h\ for N = 40. 

The question of whether pairwise models remain good 
for N beyond 40, even in the regime when the above 
perturbation arguments break down, cannot be answered 
without further experimental data. We can conclude, 
however, that pairwise models are valid beyond the point 
where pairwise interactions merely represent trivial 
perturbative corrections to the dominant intrinsic biases 
h{, which happens already at N ~ 10. 



VII. NETWORK ENSEMBLES 

In trying to develop a theoretical approach to biological 
systems, there is a tension between the search for univer- 
sals and the need to engage with the details of specific 
systems. In the present context, even if the maximum en- 
tropy models provide a perfect description of the prob- 
ability distribution of network activity, one may worry 
that what we have learned is relevant only to the partic- 
ular 40 neurons we happened to record from in this one 
experiment. How can we generalize from these results? 



One idea is that what we observe in this one experiment 
is typical of what we would find by drawing networks at 
random out of some ensemble of networks. Our goal in 
this section is to identify this ensemble. 

We start our search for meaningful ensembles of net- 
works by characterizing the "thermodynamics" of the 
networks that we have observed. Having constructed a 
maximum entropy model with some parameters g, we 
can take this model seriously as a statistical mechanics 
problem and ask what happens as we change the "tem- 
perature," which is equivalent to scaling all of the cou- 
pling g — > g/T. Notice that this is just one slice through 
the large space of parameters, but it is one which has 
a physical interpretation even though the temperature 
is not itself physical. In particular we can define the en- 
tropy, heat capacity, and other thermodynamic variables. 

To proceed, we first plot the dependence of heat capac- 
ity on temperature at various system sizes N in Fig [5] 
The behavior of the heat capacity as a function of temper- 
ature, C(T,N), is diagnostic for the underlying density 
of states, and describes how the states of the system are 
populated as the temperature increases. Two systems of 
the same size with similar C(T, N) curves can thus be 
said to be similar to each other. We recall that the heat 
capacity can be estimated from Monte Carlo simulations 
at a single temperature by computing the variance of the 
energy over many binary patterns generated by the sim- 
ulation, C(T,N) — ((SE) 2 ) /T 2 . 

The first interesting result is that when we choose a 
subnetwork of N neurons at random out of the 40 cells 
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FIG. 4: a) The comparison of couplings Jy for a group of 
N — 5, 10, 15, 20 neurons, computed using the exact maxi- 
mum entropy reconstruction algorithm, with the lowest or- 
der perturbation theory result, Jy = jlogcy, where cy = 
(5'i5j)/((^i)(5j)) and o\ = 0.5(1 + a{) |18l 153] . In the case of 
larger networks, the perturbative Jy deviate more and more 
from equality (black line), b) The exact Ising approximation, 
P({<7i}), can be compared to the distribution 

-Pcxpt 

(W), 

sampled from data; the solid line shows the Jensen-Shannon 
divergence between the two distributions, for four example 
networks of size N = 5, 10, 15, 20. The dashed line shows 
the same comparison in which the Ising model parameters, 
g = {hi, Jij}, were calculated perturbatively. Already at 
N = 10 neurons, the perturbative model has an error two 
orders of magnitude larger than the exact Ising model. The 
point at N = 20 shown in gray because P({ai}) cannot be 
reliably sampled from data. 
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FIG. 5: The heat capacity, C(T,N), for systems of differ- 
ent sizes TV. Ising models were constructed for 400 subnet- 
works of size 5, 180 of size 10, 90 of size 15 and 20, 1 full 
network of size 40 (all from data, black), and 3 synthetic net- 
works of size 120 (green); vertical error bars are standard 
deviations across these examples. For networks of TV < 20, 
the partition function (and therefore heat capacity) can be 
computed exactly; for TV > 20, we estimate heat capacity 
from the variance in energy during Monte Carlo sampling, 
i.e. C{T,N) = {{8E) 2 )/T 2 . The mean and 1-sigma enve- 
lope for C(T, TV) of Ising models of randomized networks at 
TV = 20 are shown in blue dashed lines. We note that the 
peak of the heat capacity moves towards the operating point 
at T = 1 with increasing size. 



from which we record, for TV = 5, 10, 15, or 20, we find 
that the fluctuations in C(T, TV) are very small. This 
suggests that the individual networks, even when quite 
small, are typical of the ensemble of subnetworks that we 
can choose out of this small patch in the retina. 

More ambitiously, we can try to construct an artificial 
ensemble of networks that reproduces the distribution of 
mean spike probabilities and the distribution of pairwise 
correlations that we see in the experiment. Concretely, 
we assign to each neuron a mean spike probability cho- 
sen at random from the observed distribution of mean 
spike probabilities (Fig Hp shows, implicitly, the cumula- 
tive distribution of (&{)), and we assign to each pair of 
cells a correlation chosen at random from the observed 
distribution of correlations, shown in Fig [l^. Note that 
not all combinations of means and correlations are possi- 
ble for binary variables; after each draw from the distri- 
bution of (<7i) and Cij , we check that all 2 x 2 marginal dis- 
tributions are consistent, and repeat the draw if needed. 
Once the whole synthetic covariance matrix is generated, 
we check (e.g. using the Kolmogorov-Smirnov test) that 
the distribution of its elements is consistent with the mea- 



sured distribution. Figure [5] shows C(T, TV) for networks 
of 20 neurons constructed in this way, and we see that, 
within error bars, the behavior of these randomly chosen 
systems resembles that of real 20 neuron groups in the 
retina. 

We emphasize that ensemble we construct here is very 
different from the usual one in statistical mechanics. In 
the theory of spin glasses [55] . it is the interactions 
which are chosen at random. Importantly, each pairwise 
interaction is chosen independently. If we try to do this in 
our problem, we find a disaster — when we randomize the 
fields hi and interactions , we find that distribution of 
mean spike probabilities (<7j) changes dramatically, and 
as a result everything else about the network also changes 
(heat capacity, entropy, ... ). In contrast, here we keep 
the distribution of observables, i.e. the pairwise corre- 
lations and firing rates, fixed at their measured values, 
and moreover independent of TV, as motivated by exper- 
iments that find no decay in the pairwise correlation in 
patches of ~ 200 neurons. 

A striking feature of the spin glass problem is that 
when the Jy are drawn independently and at random, 
the correlations among the spins have a rich, hierarchi- 
cal structure [55] . In our construction method we draw 
the correlation matrix elements independently (subject 
to the consistency condition above) , and therefore expect 
that this will induce a complicated correlation structure 
in the space of Jy couplings. 

We are building a spin glass model in which all pairs 
potentially interact, and all the pairwise correlations are 
drawn from the same distribution. In this sense, we ex- 
pect that we have some sort of mean-field model, but the 
correlations have a scale set by experiment, and hence 
cannot be reduced as TV becomes larger. This family 
of models clearly cannot have a normal thermodynamic 
limit as TV becomes large. This is not a failing of the 
model, however: the real neural system has correlations 
that do not appreciably decay with distance in meso- 
scopic patches (presumably because of the extended con- 
nectivity of the neurons and correlations in the stimu- 
lus), and maximum entropy models can explore the con- 
sequences of these measured constraints. Recalling our 



results on the toy model in Section III we might expect 
our systems to be driven towards criticality as TV is in- 
creased; we explore this issue in the next section. 



VIII. LARGER NETWORKS AND 
CRITICALITY 

Figure [5] reveals an interesting behavior of heat ca- 
pacity C(T, TV): as the size of the system increases, the 
peak of the heat capacity moves closer and closer to the 
operating point at T = 1 , in networks of size TV < 40 con- 
structed from data. Armed with the results at TV = 20 
and an operational definition of a network ensemble, we 
decided to check if this behavior continues to larger TV. 
We thus generated several synthetic networks of 120 neu- 
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rons by randomly choosing once more out of the distri- 
bution of (<Ti) and Cy observed experimentally. The heat 
capacity C(T,N = 120) now has a dramatic peak at 
T* = 1.07 ± 0.02, very close to the operating point at 
T=l. 

To be clear, we note that the shift in the peak of heat 
capacity with the system size TV is a direct consequence of 
pairwise couplings in the model, and hence this structure 
is driven by the measured correlation among neurons. In 
independent models, a simple calculation shows that 



A' 



C ind (T,N) = J2 ( h -J T ) 2 [1 ~ tanh 2 (VT)] 



(24) 



If the brain is interested directly in how surprised it 
should be by the current state of its inputs, then it might 
be important to maximize the dynamic range for instan- 
taneously representing this (negative log) likelihood, that 
is, to have at its disposal a codebook with word frequen- 
cies whose range is wide enough to encompass the range 
of probabilities of sensory events. In contrast, simple no- 
tions of efficient coding would require all symbols to be 
used with equal probability. But since states of the visual 
world occur with wildly varying probabilities, achieving 
this notion of efficiency would require block codes that 
are extended over time and arc therefore slow. 



where hi = tanh ((o"i)); for our dataset, the ensemble 
average of C[ n d(T,N) for independent models peaks at 
about T ~ 1.7 regardless of TV, and indeed, when nor- 
malized by TV, the curves of Ci n( j(7 1 , TV) /TV collapse onto 
a single curve. In contrast, in pairwise models of the 
same data, the peak moves from T ~ 1.7 for TV = 5 to 
T ~ 1.3 for TV — 40, while the heat capacity per neu- 
ron, C(T, TV)/TV increases by about 40%. This is yet an- 
other indication that, although the correlation between 
any two neurons is weak, neglecting these correlations 
gives a qualitatively wrong picture of the states accessi- 
ble to the network as a whole. 

In physical systems, a sharp peak in the heat capacity, 
becoming singular as TV becomes large, is associated with 
a critical point in the phase diagram. The critical point 
is distinguished by the fact that at this point, the system 
is maximally sensitive to small changes in parameters. It 
has been suggested that this sensitivity makes operation 
at a critical point attractive as a strategy for biological 
sensory and signaling systems [55) |571 |58]. Behavior in 
the neighborhood of the critical point also is universal, so 
that systems with many different microscopic structures 
can generate, quantitatively, the same critical behavior. 
It should be noted that there are different notions of crit- 
icality which have been applied to biological systems. As 
far as we know this is the first evidence for criticality of 
a biological network in the thermodynamic sense. 

From the thermodynamic point of view, the critical 
point marks a boundary between phases. But from a 
statistical point of view, this point is at an extremum. 
Specifically, in a large system we expect that almost all 
states which are accessible will have nearly equal prob- 
ability; in information theory this is the idea of "typi- 
cality" (usually applied to long sequences), and in sta- 
tistical mechanics this is the equivalence of the canoni- 
cal and microcanonical ensembles. The critical point is 
the place where the corrections to this expectation are 
largest. More precisely, if we write the probability distri- 
bution in the Boltzmann form, as in Eq (JTJ, the (nega- 
tive) log of the probability of visiting a state is the energy, 
and the heat capacity is proportional to the variance of 
the energy. Thus, at the critical point, where the heat 
capacity has its maximal value, the variance of log prob- 
ability is largest. 



IX. ENTROPY AND MULTI-INFORMATION 



Scaling of the temperature introduced in Section |VII| 
is useful for another reason: it enables us to estimate the 
entropy of P({eri}). The entropy in the context of coding 
measures the capacity of the neurons to convey infor- 
mation about the visual world: the single-neuron biases 
and interactions effectively reduce the total number of 
likely binary patterns (or the codebook size) from 2 N to 

2 S(T=1,N) and 

we would like to quantify this decrease. 

We recall that 



S(TV) = S(T = l,N) 



C(T, TV) 
T 



dT, 



(25) 



where the heat capacity C(T, TV) can be estimated by 
Monte Carlo from the variance of the energy, C(T, TV) = 
((SE) 2 }/T 2 , by drawing a large number of spin configu- 
rations at a fixed temperature T, computing their energy 
using the Hamiltonian of Eq ([lj and taking the variance. 
Then, the integral in Eq ( p5| is performed up to T = 1, 
corresponding to the pairwise model of the real data. Es- 
timating the entropy using the heat capacity integration 
in Eq (251 is crucial for TV > 20, because direct estima- 



tion from raw Monte Carlo samples becomes infeasiblc. 



If we integrate Eq ( 25 1 to find the entropy of the large 



synthetic networks, we find that the independent entropy 
of the individual spins, So (120) = 17. 8±0. 2 bits, has been 
reduced to S(120) = 10.7 ± 0.2 bits by pairwise interac- 
tions. As Fig [6] shows, even at TV = 120 the entropy 
deficit or multi-information /(TV) = S (TV) - S(TV) for 
the Ising models continues to grow in proportion to the 
number of pairs (~ TV 2 ), continuing the pattern found 
in smaller networks [I]. We also note that Ising mod- 
els provide a lower bound on the total amount of multi- 
information in the real distribution, because they only 
capture the pairwise structure. Therefore, if higher-order 
couplings become important at larger TV, the real /(TV) 
might be bigger that the one estimated in Fig|6] The next 
generation of experiments [32], probing the TV ~ 100—200 
regime, will provide a decisive test of the maximum en- 
tropy model predictions. 
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FIG. 6: The scaling of multi-information / and the indepen- 
dent entropy So with the system size. Points at network size 
N — 5, 10, 15, 20, 40 are computed from real data, with error 
bars denoting the spread across many subnetworks selected 
from N — 40 (there is no error estimate for the single network 
at TV = 40). Dashed lines are scaling fits; independent en- 
tropy scales approximately linearly, and multi-approximation 
approximately quadratically, with system size N, and this 
scaling continues up to the three synthetic networks of size 
N = 120. 
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FIG. 7: (a) Probability that the 40 neuron system is found 
in a configuration within the basin of attraction of each non- 
trivial locally stable state Q a , as a function of time during 
the stimulus movie; P(G a \t) = 0.4 means that the retina is in 
that basin on 40% of the 145 repetitions of the movie, (b) All 
unique patterns assigned to <?5 at t = 10.88 — 10.92 s. (c) Zipf 
plot of the rank ordered frequencies with which the lowest 
lying 5 • 10 4 locally stable states are found in the simulated 



120 neuron system. 



X. LOCALLY STABLE STATES 

In the Hopfield model, dynamics of the neural net- 
work corresponds to motion on an energy surface. Sim- 
ple learning rules can sculpt the energy surface to gen- 
erate multiple local minima, or attractors, into which 
the system can settle. These local minima can represent 
stored memories, or the solutions to various computa- 
tional problems [59j [60] . By analogy with spin glasses, 
we can think of these multiple attractors, or locally stable 
states, as resulting from the competition between posi- 
tive and negative interactions. In our maximum entropy 
models, we similarly find a range of values encompass- 
ing both signs, as shown in Figs [TJd and [if. We would 
like to understand whether this structure leads to multi- 
ple attractors, and what this means for the nature of the 
neural code. 

Locally stable states are patterns of activity Q a = {ct;}, 
such that a flip of any single spin in Q a increases the en- 
ergy (or decreases the likelihood) of the new state. At 
N = 40 we find 4 such local energy minima (Q 2 , ■ ■ ■ , G5) 
in the observed sample that are stable against single spin 
flips, in addition to the silent state Q\ {a\ — —1 for all i). 
Using zero-temperature Monte Carlo, each configuration 
observed in the experimental data is assigned to its cor- 
responding stable state: one starts with a binary pattern 



observed in the data, and flips the spins as long as each 
spin flip decreases the energy under the reconstructed 
Ising Hamiltonian TL. The local energy minimum thus 
found is matched to one of the Q a . Although this as- 
signment makes no reference to the visual stimulus, the 
collective states Q a are reproducible across multiple pre- 
sentations of the same movie (Fig |7^), even when the 
microscopic state {ci} varies substantially (Fig [7)3). 

In a simulated network of N — 120 neurons, we find a 
much richer energy landscape. Looking in detail at the 
distribution of , we find that it is approximately Gaus- 
sian J = -0.016 ± 0.004 and er, 7 = 0.61 ± 0.04, with 53% 
of triangles being frustrated (46% at N = 40), indicating 
the possibility of many single-spin-flip stable states, as 
in spin glasses [55]. For each N — 120 network, we thus 
generated one long run collecting 2-10 7 independent sam- 
ples. For each sample zero-temperature Monte Carlo is 
again used to determine the appropriate basin of attrac- 
tion; we tracked 5 • 10 4 lowest metastable states and kept 
detailed statistics for 10 3 lowest. We conclude that the 
Gibbs state now is a superposition of thousands of Q a , 
with a nearly Zipf-like distribution (Fig[7j;). We checked 
that the probability of the Monte Carlo to generate a spin 
configuration that belongs to basin of attraction Q a is to 
a good approximation proportional to e~ F °, where the 
free energy F a = (E) a — S a is the difference between the 
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average energy in the basin of attraction and its entropy, 
both estimated from the simulation data (i.e., from bi- 
nary patterns that belong to the basin Q a ). The average 
escape barrier size for the studied meta-stable states is 
~ 5, in natural units used in Eq ([lj. 

The entropy of the distribution of metastable states, 
P(G a ) is 3.4 ±0.3 bits, about a third of the total entropy. 
Thus, a substantial fraction of the network's capacity to 
convey visual information would be carried by the collec- 
tive state, that is, by the identity of the basin of attrac- 
tion, rather than by the detailed microscopic states. 

Based on these observations, we can formulate the fol- 
lowing hypothesis: Trial-to-trial variability demonstrates 
that neurons must be to some extent noisy and therefore 
each 'microscopic' state or binary pattern {ai} cannot 
be a codeword with a separate meaning; instead, the 
space of binary patterns must be partitioned into do- 
mains or regions containing similar patterns with syn- 
onymous meanings. We propose that such domains are 
exactly the basins of attraction of locally stable states of 
H. One desirable property of this choice is that the simi- 
larity metric for microscopic states is the energy function 
(or the likelihood) itself, which captures our intuition 
that the most frequently observed binary word among 
words that differ in a single letter is probably the 'cor- 
rectly spelled' variant, and the other variations are akin 
to spelling mistakes. The second desirable characteris- 
tic is that an associative and error correcting mechanism 
for parsing such words already exists, and is simply the 
Hopfield network. We also find that on the synthetic 
network of N = 120 neurons, observing K <C N neurons 
is on average approximately two times more informative 
about the collective state Q a than about an unobserved 
neuron, i.e. I(a;Q a ) / 'S(Q a ) ~ 21 (a; ai)/S{<r{), where a 
is a group of K < 10 neurons that doesn't include cr/, and 
the information is normalized by the corresponding total 
entropy. This indicates that the decoding of locally sta- 
ble states from a neural sub-population should be easier 
than decoding the microscopic pattern of activity. 

To test this hypothesis, one would need to compare the 
information that basins of attraction provide about the 
stimulus s, I(Q a \s), with the information that the mi- 
croscopic state provides about the stimulus, I{{a{\\s). 
If these two quantities are similar in size, then basin-of- 
attraction identity Q a is a good summary (or compres- 
sion) of the microscopic state, and can be used to trans- 
mit information. Experiments could also be used to refine 
the hypothesis by checking if, for example, the basin of 
attraction provides information about some identifiable 
invariant property of the stimulus (e.g., spatial shape), 
while additional information about other aspects of the 
stimulus (e.g., contrast) is encoded in the microscopic 
pattern within the basin of attraction. Unfortunately, 
in the real networks of N — 40 neurons we can just 
start to detect the emergence of the metastable states, 
and we are thus unable to perform the test. While at 
N = 120 neurons the structure is much richer, these syn- 
thetic networks have no associated set of 'experimentally 



observed' patters locked to the stimulus, so such a test, 
even simulated, can also not be performed. To conclude 
this section, we note that locally stable states can be de- 
fined without a reference to a particular model (the Ising 
model here) , by simply finding patterns that are sampled 
well enough in the data and are more frequent than all 
of their single-spin- flip neighbors |45j . Even if pairwise 
models were to break down at higher N, our suggestion 
might still provide a viable coding hypothesis. 



XI. DISCUSSION 

Ising models, with a spin glass structure of compet- 
ing interactions, are the least structured models that can 
describe the observed mean spike probability and pair- 
wise correlations in networks of real neurons. Remark- 
ably, these minimal models continue to provide a good 
description of the higher order correlations among reti- 
nal neurons up to N — 40. Although correlations among 
pairs of cells are weak, the behavior of these large groups 
of cells is strongly collective [61] . 

In the Ising model, the bias for a particular neuron to 
fire (spin up) or remain silent (spin down) has two com- 
ponents, one intrinsic to the neuron and one from the 
interactions with other neurons. At N = 40, these dif- 
ferent contributions are comparable, and so the pairwise 
models cannot simply be viewed as a small perturbative 
correction to the independent model. Nevertheless, the 
sparsity of spikes is of crucial importance for models of 
neural behavior. In small networks at least, rare spiking 
means that higher order interactions don't have much 
chance to contribute even if they are present, and we 
hypothesize that this property of neural code could be 
important also as the brain tries (literally) to make sense 
of the incoming data. 

Having found that the Ising model provides a good de- 
scription of the real network, we are encouraged to take 
the model seriously as a statistical mechanics problem. In 
particular, since the system has competing interactions 
that differ from neuron to neuron, we would like to un- 
derstand, in the spirit of spin glass theory |55j . whether 
the particular system that we observe is typical of some 
ensemble of networks. We have been able to construct 
such an ensemble, and use this as a tool to predict the 
behavior of larger systems. In the salamander retina, the 
"correlated patch" of neurons, within which the pairwise 
correlations are largely independent of the distance be- 
tween cells, contains N ~ 200 cells, so we would like to 
understand what our framework predicts on this scale. 
Remarkably, the larger networks that we construct seem 
to be operating close to a critical point in their phase 
diagram. 

Criticality is perhaps the most dramatic signature of 
collective behavior. The next generation of experiments 
should have access to the full population of N ~ 200 
cells |30l |3~T1 13"2"] , and will test this prediction in detail. 
As emphasized in recent work on natural images, one can 
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find evidence of criticality even without constructing an 
explicit statistical model (45]. Further, it should be re- 
membered that the pattern of correlations in the retina 
depends not just on the underlying connectivity or the 
structure of the visual inputs, but also on the adaptation 
state of the system. While much has been learned about 
the way in which individual neurons adapt to changes 
in the input stimulus statistics, especially in the retina 
[fJ21 153"] , much less is known about how these adaptation 
processes influence the behavior of whole networks. If 
operation at a critical point is an essential feature of net- 
work function, adaptation after a sudden change in input 
statistics should bring the system back to criticality, even 
if the exact pattern of spiking probability and correla- 
tions in the network is changed, and this is testable. For 
a model of how adaptive dynamics can enforce critical 
behavior, see Ref [64] . 

Our second prediction concerns the emergence and role 
of the locally stable states in the probability distribu- 
tion of binary words, P({cti}): we have shown that at 
N = 40 several locally stable states appear reproducibly 
from trial to trial despite substantial variability in the 
detailed binary patterns of neural activity. Importantly, 
the analysis which identifies these states makes no refer- 
ence to the visual stimulus, yet these states are tied to 
the stimulus in a way which suggests a role in the neural 
code. Furthermore, by studying the synthetic networks 
we have demonstrated that the vocabulary of such states 
could vastly expand by the time N reaches ^100, provid- 
ing enough capacity and dynamic range even to dominate 



the encoding of incoming stimuli. Such a collective code 
is not inconsistent with the single-neuron results obtained 
to date; rather, one possible manifestation of collective 
coding is that, as more and more neurons are observed 
simultaneously, what was thought to be noise or uncor- 
rclated fluctuations in a single neuron, is really a part of 
the collective state. 

Both ideas — tuning to criticality and the use of sta- 
ble states as robust codewords — have been motivated by 
experiments on a 40-neuron network where the begin- 
nings of nontrivial collective behavior could be identified. 
We have outlined how these predictions can plausibly be 
tested in a next generation of experiments that will access 
networks of N ~ 100 or larger. If validated, these and 
similar results would provide a substantive link from real 
data to to the large body of theoretical work on neural 
networks. 
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